A house value is simply more than location and square footage. Like the features that make up a person, an educated party would want to know all aspects that give a house its value. For example, you want to sell a house and you don’t know the price which you can take — it can’t be too low or too high. Currently to find house price , an concern person has to manually try to find similar properties in the neighbourhood , and according to that estimate an house price which requires a lot of time. And again in the same neighnourhood some other person has to sell or buy a house , again he has to manually do the whole steps.
We can create a prediction model, which will consider the house's age, number of bathrooms , bedrooms , square fts etc and for a new set of input it will predict the house price. It will save a lot of time.
Find relationship between features so identify what influences which parametes
We will need survey data , which will consists of house details and house prices. House details can be sqaure fts, number of bedrooms, bathrooms, ceils , furnished or not furnished, sight , location etc.
As part of capstone project, we were given a dataset which consists of around 21000 data points. We will use this data set to create a model for house price prediction.
This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.
Following are set of columns that are present in the dataset
1. cid: a notation for a house
2. dayhours: Date house was sold
3. price: Price is prediction target
4. room_bed: Number of Bedrooms/House
5. room_bath: Number of bathrooms/bedrooms
6. living_measure: square footage of the home
7. lot_measure: quare footage of the lot
8. ceil: Total floors (levels) in house
9. coast: House which has a view to a waterfront
10. sight: Has been viewed
11. condition: How good the condition is (Overall)
12. quality: grade given to the housing unit, based on grading system
13. ceil_measure: square footage of house apart from basement
14. basement_measure: square footage of the basement
15. yr_built: Built Year
16. yr_renovated: Year when house was renovated
17. zipcode: zip
18. lat: Latitude coordinate
19. long: Longitude coordinate
20. living_measure15: Living room area in 2015(implies-- some renovations) This might or might not have affected the lotsize area
21. lot_measure15: lotSize area in 2015(implies-- some renovations)
22. furnished: Based on the quality of room
23: total_area: Measure of both living and lot
### Importing necessary modules
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(color_codes=True)
sns.set_style("whitegrid")
%matplotlib inline
import itertools
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split,GridSearchCV,cross_validate,cross_val_score
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingRegressor
from sklearn.decomposition import PCA
from matplotlib import style
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('ignore')
## importing data set
innercity_df = pd.read_csv('innercity.csv')
innercity_df.head()
innercity_df.shape
innercity_df.info()
After checking the data set in exel , we found that the dayhours columns is span for only 1 year.
The data type of dayhours in Object. We might have to change that , or we might have to remove this column as time frame is only 1 year.
## Checking for null data
print(innercity_df.isnull().any())
def missing_val(df):
# here is a function to check for the missing values:
missing=innercity_df.isnull().sum()/len(df)*100
return missing[missing>0]
print(missing_val(innercity_df))
# The Dayhours column gives the time of selling the house. As the value of this columns is span on 1 year only,
# we will take the year only
# just the year is sufficient since there is not a lot of impact of months.
# Extracting the values of year from dayhours column:
innercity_df['dayhours']= pd.to_datetime(innercity_df['dayhours'])
print(innercity_df.info())
## taking only year part
innercity_df['dayhours'] = pd.DatetimeIndex(innercity_df['dayhours']).year
innercity_df.info()
print(innercity_df['dayhours'].unique())
innercity_df.head()
sns.distplot(innercity_df['price'],kde=True);
The target variables has some positive skewness, deviates from
normal distribution. Let's take a look at the skewness and kurtosis in numbers:
print("Skewness: %f" % innercity_df['price'].skew())
print("Kurtosis: %f" % innercity_df['price'].kurt())
In the data standardisation section, we will fix this.
fig = plt.figure(figsize = (15,20))
ax = fig.gca()
innercity_df.hist(ax = ax)
cols = innercity_df.columns
length = len(cols)
fig = plt.figure(figsize=(13,20))
for i,j in itertools.zip_longest(cols,range(length)):
plt.subplot(6,4,j+1)
ax = sns.distplot(innercity_df[i],kde=True,hist_kws={"linewidth": 1})
ax.set_facecolor("w")
# plt.axvline(innercity_df[i].mean(),linestyle="dashed",label="mean",color="k" )
# plt.legend(loc="best")
plt.title(i,color="navy")
plt.xlabel("")
From KDE we can see for some univariate distribution there are multiple picks. which could indicate some clusters.
living measure,ceil_measure,living_meassure15 has somewhat skewed distributions
innercity_df.groupby(['zipcode']).price.count().sort_values().plot(kind='bar',figsize=(15,6))
plt.title("Zipcode Vs Price")
innercity_df.groupby(['room_bed']).price.count().plot(kind='bar',figsize=(15,6))
plt.title(" No. of Bedrooms Vs Price")
innercity_df.groupby(['quality']).price.count().plot(kind='bar',figsize=(15,6))
plt.title(" Quality Vs Price")
innercity_df.groupby(['condition']).price.count().plot(kind='bar',figsize=(15,6))
plt.title(" Condition Vs Price")
## different features by plotting them to determine the relationship to SalePrice
var = 'room_bed'
data = pd.concat([innercity_df['price'], innercity_df[var]], axis=1)
f, ax = plt.subplots(figsize=(14, 6))
fig = sns.boxplot(x=var, y="price", data=data)
fig.axis(ymin=0, ymax=3500000);
var = 'quality'
data = pd.concat([innercity_df['price'], innercity_df[var]], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x=var, y="price", data=data)
fig.axis(ymin=0, ymax=3500000);
var = 'room_bath'
data = pd.concat([innercity_df['price'], innercity_df[var]], axis=1)
f, ax = plt.subplots(figsize=(20, 20))
fig = sns.boxplot(x=var, y="price", data=data)
fig.axis(ymin=0, ymax=3500000);
innercity_df['room_bath'].unique()
features "room_bed", "quality", "room_bath" increase, so does the price.
There are outliers for each of these features
var = 'yr_built'
data = pd.concat([innercity_df['price'], innercity_df[var]], axis=1)
f, ax = plt.subplots(figsize=(16, 8))
fig = sns.boxplot(x=var, y="price", data=data)
fig.axis(ymin=0, ymax=800000);
plt.xticks(rotation=90);
Generally with year of build of houses should have a linear relationship house price.
But data is not showing that
features = innercity_df.columns.tolist()
features.remove('cid')
features.remove('price')
style.use('fivethirtyeight')
cm = plt.cm.get_cmap('RdYlBu')
xy = range(1,21614)
z = xy
for feature in features:
sc = plt.scatter(innercity_df[feature], innercity_df['price'], label = feature, c = z, marker = 'o', s = 30, cmap = cm)
plt.colorbar(sc)
plt.xlabel(''+feature)
plt.ylabel('price')
# plt.yticks([0, 500000, 1000000, 1500000, 2000000, 2500000, 3000000, 3500000, 4000000, 4500000, 5000000, 5500000, 6000000, 6500000, 7000000, 7500000, 8000000],
# ['0', '0.5M', '1M', '1.5M', '2M', '2.5M', '3M', '3.5M', '4M', '4.5M', '5M', '5.5M', '6M', '6.5M', '7M', '7.5M', '8M',])
plt.legend()
plt.show()
- There is a linear relationship of price with ceil_measure,living_measure15,Living measure
- Price of furnished houses are higher than unfurnished houses as expected
- With quality increases the price of the house also increases
- As this data is collected from a particular area, so long and lat are concentrated on in one particular range
- Clearly with number of bathroom increases, price also increase
- There is one particluar data with 33 bed rooms, which very unlikely. we may want to drop this particular outlier. because it will have an impact while creating model.
- We will verify the above with heatmap
## Relation ship between different independent variables
sns.pairplot(innercity_df.iloc[:,0:10] )
sns.pairplot(innercity_df.iloc[:,10:] )
some independent varibles has linear correlation.
But maximum of the variables are having very weak relation or no relation between them
We will check the same in heatmap
corrmat = innercity_df.corr()
f, ax = plt.subplots(figsize=(25, 15))
sns.heatmap(corrmat, vmax=.8, square=True, annot=True);
- Also from heatmap we can see ceil_measure,living_measure15,Living measure has high correlation values
- cid aslo has low correlation. And cid is anyway just a unique identifier, during model building we have to ignore this cid
- as mentioned earlier dayhour is only from 2014 and 2015, also from heat map , with price there is very low correlation value.so we can try to ignore this column also
- very some independent variables has some high correlation values . for example :
room-bath and living measure has a .75 correlation value.
room_bath also have high correlation with ceil_measure
quality aslo has high correlation with living_measure and ceil_measure
ceil_measre and living measure has .88 correlation value which is expected.
living_measure15 , living Measure also highly correlated.
but maximum of the variables have very weak correlation. as we have seen in the pair plot.
We will consider zipcode as categorial variable in later stage
## we will see the lat and long values where exactly it is residing . as we have 20000 data points.
#it is not feasible to plot all data. we will plot only 2000 sampling records
import folium
import math
median_lat = innercity_df['lat'].median()
median_long = innercity_df['long'].median()
m = folium.Map(location=[median_lat, median_long], zoom_start=12)
# numofhtml = math.ceil(len(innercity_df['lat'])/2000)
# for j in range(numofhtml):
for i in range(2000):
folium.Marker([innercity_df.loc[i, "lat"],innercity_df.loc[i, "long"]],
).add_to(m),
# m.save('geomap_'+str(j)+'.html')
m.save('geomap.html')
m
## applying log to price to make it gaussian distributed
innercity_df['price1']=np.log(innercity_df['price'])
sns.distplot(innercity_df['price1'],kde=True);
Before distribution of price was skewed . after applying log function,it was symmetric
We will check whetehr any correlation value improved after applying log
corrmat1 = innercity_df.corr()
f, ax = plt.subplots(figsize=(25, 15))
sns.heatmap(corrmat1, vmax=.8, square=True, annot=True);
# print(corrmat['price'])
# print(corrmat1['price']
df = pd.DataFrame({'price correlation before log': corrmat['price'],'price correlation after log':corrmat1['price1']})
df
As there is not much difference of value before and after applying log to price value.
we will retain the price without log applied
### remove cid and dayhours as analysed above
innercity_df.drop(['cid','dayhours','price1'],axis=1,inplace=True)
### as meniotned earlier removing the particular row of num of bed 33 , to reduce the imapce of extrem outlier
innercity_df[innercity_df['room_bed']==33]
innercity_df.drop(innercity_df[innercity_df['room_bed']==33].index, axis=0,inplace=True)
innercity_df.head()
As target variable is continous variable, we will go for regression modelling.
## Creating data X,y from the dataset,
## Converting zipcode to categorical variables
innercity_df = pd.get_dummies(innercity_df,columns=['zipcode'],drop_first=True)
X = innercity_df.drop('price',axis=1)
Y = innercity_df['price']
X.shape
## deviding data set into 3 categories train,test,validation
x_train_validation,x_test,y_train_validation,y_test = train_test_split(X, Y, test_size=0.3,random_state=3)
x_train,x_validation,y_train,y_validation = train_test_split(x_train_validation, y_train_validation,test_size=0.3, random_state=3)
sc_X = StandardScaler()
X_stnd_train = sc_X.fit_transform(x_train)
X_stnd_test = sc_X.transform(x_test)
X_stnd_validation = sc_X.transform(x_validation)
X_stnd = sc_X.fit_transform(X)
# X_stnd_validation
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_stnd_train, y_train)
accuracy = regressor.score(X_stnd_validation, y_validation)
print("Testing score: {}%".format(int(round(accuracy * 100))))
scores = cross_val_score(regressor,X_stnd,Y,cv = 10)
print("cross val scores: ",scores)
print("cross val scores mean: ",scores.mean())
ypred = regressor.predict(X_stnd_validation)
np.sqrt(mean_squared_error(y_validation,(ypred)))
sns.regplot(x=ypred,y=y_validation)
from sklearn.tree import DecisionTreeRegressor
dt_model1 = DecisionTreeRegressor()
dt_model1.fit(X_stnd_train, y_train)
dt_model1_train_score = dt_model1.score(X_stnd_train, y_train)
print("Training score: ",dt_model1_train_score)
dt_model1_test_score = dt_model1.score(X_stnd_validation, y_validation)
print("Testing score: ",dt_model1_test_score)
scores = cross_val_score(dt_model1,X_stnd,Y,cv = 10)
print(scores)
print(scores.mean())
ypred = dt_model1.predict(X_stnd_validation)
print(np.sqrt(mean_squared_error(y_validation,(ypred))))
sns.regplot(x=ypred,y=y_validation)
As expected due to overfitting of decision tree the training score is 99. But testing score is 71%. We will tune the decisiontree later
from sklearn import svm
svrmodel1 = svm.SVR(kernel='rbf',degree=8,C=5)
svrmodel1.fit(X_stnd_train, y_train)
svrmodel1_score = svrmodel1.score(X_stnd_validation, y_validation)
print("Testing score: ",svrmodel1_score)
# scores = cross_val_score(svrmodel1,X_stnd,Y,cv = 10)
# scores
# SVR is giving worst result
### APPLY PCA AND CHECK
from sklearn.decomposition import PCA
pca = PCA().fit(X_stnd_train)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
pca_final = PCA(n_components=80).fit(X_stnd_train)
X_pca_train1=pca_final.transform(X_stnd_train)
X_pca_test1=pca_final.transform(X_stnd_test)
X_pca_validation1=pca_final.transform(X_stnd_validation)
regressorPca = LinearRegression()
regressorPca.fit(X_pca_train1, y_train)
accuracy = regressorPca.score(X_pca_validation1, y_validation)
"Accuracy: {}%".format(int(round(accuracy * 100)))
dt_modelpca = DecisionTreeRegressor()
dt_modelpca.fit(X_pca_train1, y_train)
dt_model_pca_train_score = dt_modelpca.score(X_pca_train1, y_train)
print("Training score: ",dt_model_pca_train_score)
dt_model_pca_test_score = dt_modelpca.score(X_pca_validation1, y_validation)
print("Testing score: ",dt_model_pca_test_score)
ypred = dt_modelpca.predict(X_pca_validation1)
print(np.sqrt(mean_squared_error(y_validation,(ypred))))
sns.regplot(x=ypred,y=y_validation)
We will continue with original xtrain and x test
SVR is also giving worse result. we will not proceed with SVR
we will continue with Linear Regression and DecisionTree
param = {'max_depth': [5,10,15,20],'criterion':['mse','friedman_mse','mae'],'min_samples_leaf': [2,5,7,8,9,10]}
### Tuning paramters for decision tree
dt_model1 = DecisionTreeRegressor()
dt_grid_model1 = GridSearchCV(dt_model1,param_grid=param)
dt_grid_model1.fit(X_stnd_train,y_train)
dt_grid_model1.best_params_
# dt_grid_model1.best_score_
dt_model2 = DecisionTreeRegressor(criterion='mse',max_depth=10,min_samples_leaf=7)
dt_model2.fit(X_stnd_train, y_train)
dt_model2_train_score = dt_model2.score(X_stnd_train, y_train)
print("Training score: ",dt_model2_train_score)
dt_model2_test_score = dt_model2.score(X_stnd_validation, y_validation)
print("Testing score: ",dt_model2_test_score)
scores = cross_val_score(dt_model2,X_stnd,Y,cv = 10)
print("cross val scores: ",scores)
print("cross val scores mean: ",scores.mean())
ypred = dt_model2.predict(X_stnd_validation)
print(np.sqrt(mean_squared_error(y_validation,(ypred))))
sns.regplot(x=ypred,y=y_validation)
featues_df=pd.DataFrame(dt_model2.feature_importances_,columns= ['importance'], index=x_train.columns).sort_values('importance')
featues_df
from sklearn.linear_model import Lasso
lassoreg = Lasso(normalize=True, max_iter=1e5)
param = {'alpha': [1e-15, 1e-10, 1e-8, 1e-4, 1e-3,1e-2, 1, 5, 10, 20]}
dt_grid_model2 = GridSearchCV(lassoreg,param_grid=param)
dt_grid_model2.fit(X_stnd_train,y_train)
dt_grid_model2.best_params_
lassoreg1 = Lasso(normalize=True, max_iter=1e5,alpha=0.01)
lassoreg1.fit(X_stnd_train,y_train)
lassoreg1.score(X_stnd_validation, y_validation)
lassoreg1.
ypred = lassoreg1.predict(X_stnd_validation)
print(np.sqrt(mean_squared_error(y_validation,(ypred))))
sns.regplot(x=ypred,y=y_validation)
## Apply Ridge
from sklearn.linear_model import Ridge
ridgereg = Ridge(alpha=1e-100,normalize=True)
param = {'alpha': [1e-15, 1e-10, 1e-8, 1e-4, 1e-3,1e-2, 1, 5, 10, 20]}
dt_grid_model3 = GridSearchCV(ridgereg,param_grid=param)
dt_grid_model3.fit(X_stnd_train,y_train)
dt_grid_model3.best_params_
ridgereg = Ridge(alpha=0.0001,normalize=True)
ridgereg.fit(X_stnd_train,y_train)
ridgereg.score(X_stnd_validation, y_validation)
ypred = ridgereg.predict(X_stnd_validation)
print(np.sqrt(mean_squared_error(y_validation,(ypred))))
sns.regplot(x=ypred,y=y_validation)
from sklearn.ensemble import GradientBoostingRegressor
min_samples_split_val = [int(x) for x in np.arange(2,5)]
min_samples_leaf_val = [int(x) for x in np.arange(2,5)]
max_depth_val = [float(x) for x in np.arange(2,5)]
param = {'n_estimators': [1000,1500,2000], 'max_depth': max_depth_val,'min_samples_split':min_samples_split_val, 'min_samples_leaf':min_samples_leaf_val}
gbcl = GradientBoostingRegressor()
gbcl_grid_model = GridSearchCV(gbcl,param_grid=param,verbose=5,n_jobs=15)
gbcl_grid_model.fit(X_stnd_train, y_train)
print(gbcl_grid_model.best_params_)
print(gbcl_grid_model.best_score_)
gbcl_final = GradientBoostingRegressor(max_depth = 3.0, min_samples_leaf= 4, min_samples_split= 3, n_estimators= 1000)
gbcl_final.fit(X_stnd_train, y_train)
gbcl_score = gbcl_final.score(X_stnd_validation, y_validation)
print("Testing score: ",gbcl_score)
scores = cross_val_score(randm_final_model,X_stnd,Y,cv = 10)
print("cross val scores: ",scores)
print("cross val scores mean: ",scores.mean())
ypred = gbcl_final.predict(X_stnd_validation)
print(np.sqrt(mean_squared_error(y_validation,(ypred))))
sns.regplot(x=ypred,y=y_validation)
### AFter applying gradient boosting accuracy is 90%
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
min_samples_split_val = [int(x) for x in np.arange(2,6)]
min_samples_leaf_val = [int(x) for x in np.arange(1,6)]
max_depth_val = [float(x) for x in np.arange(1,6)]
param = {'n_estimators': [1000,1500,2000], 'max_depth': max_depth_val,'min_samples_split':min_samples_split_val, 'min_samples_leaf':min_samples_leaf_val}
rf = RandomForestRegressor()
# Train the model on training data
randm_grid_model = GridSearchCV(rf,param_grid=param,verbose=5,n_jobs=15)
randm_grid_model.fit(X_stnd_train, y_train)
print(randm_grid_model.best_params_)
print(randm_grid_model.best_score_)
randm_final_model = RandomForestRegressor(max_depth = 5.0, min_samples_leaf= 1, min_samples_split= 3, n_estimators= 1000)
randm_final_model.fit(X_stnd_train, y_train)
randmfrst_score = randm_final_model.score(X_stnd_validation, y_validation)
print("Testing score: ",randmfrst_score)
scores = cross_val_score(randm_final_model,X_stnd,Y,cv = 10)
print("cross val scores: ",scores)
print("cross val scores mean: ",scores.mean())
ypred = randm_final_model.predict(X_stnd_validation)
print(np.sqrt(mean_squared_error(y_validation,(ypred))))
sns.regplot(x=ypred,y=y_validation)
gbcl_score = gbcl_final.score(X_stnd_test, y_test)
print("score on test data set for gradient boosting: ",gbcl_score)
ypred = gbcl_final.predict(X_stnd_test)
print("mean squared error on test data set for gradient boosting: ", np.sqrt(mean_squared_error(y_test,(ypred))))
sns.regplot(x=ypred,y=y_test)
We started with the Project details, Problem statement and data set description.
EDA is done on the data set . we have uni variate , bi-variate analysis , heat map analysis
Cid and dayhours columns were removed during model creation as from the eda we checked it was not having relation with the target variable.
Model creation we have started with Linear regression with 80% accuracy. and ended in with gradient boosting with 89% accuracy.
We have tried PCA, but as there was not much liner realtion ships between the variables PCA did not help much/
We have removed some rows and columns for better modelling
We have done regularization techniques(lasso and ridge) also. which default takes care of the feature selections.
Detail analysis are given after each sections
We have seen that we have data only from 2014 to 2015. So if we can have data which covers more year, that might help to increase the performance.
In future we could do cluster analysis for better results.